--- layout: post title: Can Machine Learning assist with Coronory Artery Disease diagnosis? description: . tags: - Logistic Function - Logistic Regression - Machine Learning - Cross-Entropy - Classification - Gradient Descent - Neural Networks - Notebook ---

Can Machine Learning assist with Coronory Artery Disease diagnosis?

Data Dictionary

Using CRISP-DM to improve medical decision making

Introduction

This blog focuses on the analysis of a business problem by means of following the CRISP-DM process as described by the following process flow:

CRISP-DM

The CRISP-DM (CRoss-Industry Standard Process for Data Mining) methodology provides essential support for those seeking to understand and practise data mining/ data science. It is a business focused methodology which places the focus of analysis on that which matters i.e. focusing on the business problem and data rather than methodology. More on the CRISP-DM process can be found here:

CRISP-DM

The author of CRISP-DM, Tom Khabaza has been practising data science for decades and drafted the methodology in 1999! It has been used by many since and is thoroughly worthwhile studying in great depth as it provides many pearls of wisdom on data science and business analysis.

Although the blog will outline the business analysis, and hence will not focus on code, code for the analysis can be accessed by means of expanding the code sections in the blog for those interested (in truth it is never possible to separate the two completely, and hence the code has been retained for reference purposes).

Section 1: Business Understanding

The business area of concern for this analysis is that of the diagnosis of Coronary Artery Disease by medical practitioners. Currently, the Gold Standard for diagnosis is Angiography. The problem we are investigating, is that in many settings too many Angiograms are being performed which could result in poorer patient care and outcomes.

Coronary Artery Disease (CAD ) is a disease in which there is a narrowing or blockage of the coronary arteries (blood vessels that carry blood and oxygen to the heart). Coronary heart disease is usually caused by atherosclerosis (a buildup of fatty material and plaque inside the coronary arteries).

Data Dictionary

We attempt to answer the following business questions by performing this analysis.

Question 1

  • Can data science be used to improve the diagnosis of Coronary Artery Disease by means of using existing data sources e.g. by using predictive modelling instead of Angiography?

Question 2

  • If predictive modelling cannot replace Angiography, can data science be used to reduce the number of Angiograms performed in settings where this is problematic?

Question 3

  • What are the 4 factors most highly correlated with CAD within our dataset?

We will attempt to answer these questions by interrogating data available to us.

In [1]:
Section 2: Data Understanding

For this analysis we use the Cleveland "Coronary Artery Disease" dataset found on the UCI Machine Learning Repository at the following location:

Heart Disease Dataset

The data represents data collected for 303 patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984. The data in the independent group are all available prior to an Angiogram taking place (routine, test and demographic data).

Approach

As a starting hypothesis, first prize would be to create a predictive model for prediction of Coronary artery disease based on information available prior to an Angiogram taking place. Second prize would be to identify factors associated with Coronary Artery Disease as indicated by a coronary angiography interpreted by a Cardiologist, which could be used to create a clinical algorithm to decide whether an Angiogram is indicated - again based on data available prior to an Angiogram taking place.

For our predictive model we will use the dependent/ response variable of the angiographic test result indicating a >50% diameter narrowing, which indicates Coronary Artery disease.

We will perform various Machine Learning techniques and measure model classification accuracy. Furthermore, we will do an analysis of significance of the various features as means of classifying feature importance by applying various techniques.

In [2]:
Sections 3: Data Preparation and Exploration
In [3]:
In [4]:
In [5]:
In [6]:

The distribution of positive and negative values is balanced with 46% of values denoting a positive outcome i.e. having Coronary Artery Disease (CAD). This is advantageous to us in that it will simplify the predictive modelling step. The sample size of this dataset is however very small i.e. 303, so we will have to take this into consideration in analysing our final results.

Let us consider the categorical variables.

In [7]:

It is interesting to note that there are ~30% females and ~60% males. The distribution of chest pain also seems increase in a linear fashion for this population, with the largest portion of the population suffering from severe chest pain.

Next, let us consider the continuous variables. We start by looking at age patterns.

In [8]:

It is clear that individuals suffering from coronary artery disease have a higher average age.

In [9]:
In [10]:

The violin plots demonstrate that the distributions for age, maximum HR and ST depression differ between individuals with and those without ca disease, whereas there is little difference in the distributions for resting BP and cholesterol.

The violin plot for age against coronary artery disease demonstrates that the age of individuals without ca disease is evenly spread between the ages of 40 and 65, with some younger patients below the age of 30, whereas individuals with ca disease are mostly older, with a median age of approx 60 and few, if any, below 30 years of age.

The median maximum HR for individuals without ca disease is higher (~160) than for individuals with ca disease (~150), with a narrower distribution around the mean, whereas individuals with ca disease have a skewed distribution towards lower maximum HR, with a larger proportion having max HR below 100 than healthy individuals.

The distribution for ST depression is starkly different, with individuals without ca disease having a median ST depression of 0, with a narrow distribution around the mean, and a small proportion having ST depression between 1 and 2.

In contrast, individuals with ca disease follow a broader distribution around a median of ~1.5, with a substantial proportion of individuals with ST depression >2. Resting blood pressure and cholesterol do not appear to be significantly different between patients with and without ca disease, with both groups having similar median resting BP (around 125mmHg) and cholesterol (200-250) and roughly even spread around the point estimates. A small number of individuals with ca disease have much higher resting BP of >200, whereas none of those without ca disease have a resting BP >200. However, this may not be statistically significant. Interestingly, some individuals without ca disease have very high cholesterol (500-600).

In [11]:

Similar observations to those made for the density and violin plots. We are dealing with an older population here with average age of 54 years old. There are a few outliers for high resting blood pressure with the distribution showing a slight skew to the right. Likewise for cholesterol and st_depression, with these two showing even higher skewness. Conversely max_heart_rate has outliers to left and slight skewness to left too. This makes sense, as higher values for the prior could indicate poorer health, whereas lower values for max_heart_rate could indicate poorer health, as observed in the violin plots.

The distributions of the feature variables have varying scales, so standardisation would be required for ML purposes. For regression, normalisation might improve outcomes (for this investigation we will however not perform normalisatoin). Investigation into outliers is recommended as it might reveal interesting facts and would improve the model performance if outliers were addressed.

Section 4: Modeling

Our first objective is to obtain a baseline measure of the strength of association between all the variables and the outcome. This would help us to decide whether predictive modelling is a viable solution in this setting at all.

For this, we will build a basic Logistic Regression model, without transforming or scaling any of the variables.

In [12]:
In [13]:

We now build and test a naive logistic regression model - without any transformations or optimisations.

In [14]:
AUC: 0.87
In [15]:
Normalized confusion matrix

As can be seen from the accuracy measurements the baseline model performs very well. A C-statistic of 87% on data that has not been scaled or transformed is a very good result. This result confirms our EDA outcomes which showed correlation between age and the continuous variables and the outcome variable. Based on this result it is evident that this correlation is strong for at least a few of the variables. Although this is a good result, it is still not accurate enough to replace Angiography, which is approximately 100% accurate.

We will now investigate this correlation further, but first, we will transform and scale the variables to see whether we can improve the accuracy obtained by the naive modelling approach.

Improve on Logistic Regression

We now scale and transform variables to obtain a very basic improvement on the naive model. We will not perform extensive feature engineering or advanced hyperparameter tuning at this stage.

In [16]:
In [17]:
AUC: 0.88
In [18]:
Normalized confusion matrix

After scaling and transforming the data, we observe a modest improvement in the accuracy of the model. Although accuracy in itself probably does not warrant the transformation and scaling, model performance in terms of convergence has improved by an order of magnitude as number of iterations required before convergence was previously larger than 1000,000 and after the data has been scaled number of iterations reduced to less than 100,0000. This is an encouraging result as it shows the model now captures the signal in the data without the need for excessive computation which will allow us to use more complex models to improve accuracy.

With an 88% C-statistic we are probably not accurate enough yet to replace the gold standard. Therefore, we turn our attention now to study variable correlations.

Improve further on Logistic Regression

We now perform feature selection in order to ascertain whether a smaller parsimonious model could be built with fewer variables. As per the article by (Detrano et al., 1989) this could be useful from a practical perspective as not all healthcare settings have all the variables to their disposal which necessitates the deployment of several complex predictive models which is not practical from an operational perspective.

We will first perform correlation and regression tests on the data. These tests are best performed by considering continuous and categoric variables separately due to the intrinsic difference in regression coefficient values for these variables. We will then perform a few numeric methods on the full dataset and compare results.

We start by considering the continuous variables.

In [19]:

We see that there is a very strong inverse correlation between maximum heart rate and age. This makes sense as one's maximum heart typically decreases with age. Similarly there is a strong inverse correlation between max_heart_rate and st_depression. This makes sense as a lower max_heart_rate is likely to indicate poorer health and could therefore be correlated with a greater st_depression.

We also see that there is a strong positive correlation between maximum heart rate and both cholesterol and resting blood pressure. High blood pressure and cholesterol are typically indications of poor health which would result in lower maximum heart rate.

Another observation of interest is the strong correlation between cholesterol and age. These variables could make strong combined predictors for a next iteration of the model.

The first method we use is to compare the relative importance of feature variables is that of Logistic Regression. We will consider the regression coefficient values for all our continuous variables. Scikit-learn does not implement feature importance measures for logistic regression. We therefore make use of the statsmodel libraries implementation. There is no option for a Univariate test, so we will first perform a multivariate analysis. We will thereafter make use of the mlextend library to perform a Univariate Logistic Regression test.

In [20]:
Optimization terminated successfully.
         Current function value: 0.509602
         Iterations 6

From this analysis it can be seen that the only variables of significance are max_heart_rate and st_depression. The remainder of the variables will be rejected based on their coefficient sizes.

Now we perform a univariate comparison between all the features. We use the mlxtend library for this.

In [21]:

From this analysis we can see that there are a large number of very strong predictors in this set of variables. thallium_scint scores 77% for accuracy and has the smallest confidence interval. exer_ind_angina and num_major_vessels similarly have high accuracy and small confidence intervals. chest_pain_type and max_heart_rate also have very high accuracy scores.

Next we will however make use of scikit-learn's native feature extraction methods - which also allow for Univariate tests. The Uni-variate Anova test on continuous variables as implemented in SelectKBest function 'f_classif' will be used. Let's see what the results are.

In [22]:

We will now consider the categorical variables. Let's see what the results are.

In [23]:

num_major_vessels, thal_scint and exer_ind_angina are all extremely strong predictors. chest_pain_type, st_slope and sex also contribute to the overall classification. From this analysis the only non-significant variables are rest_ecg and fasting_blood_sugar.

We have now analysed continuous at categorical data separately from a statistical perspective. Before we make the final decision on what variables to drop, we will now consider an ML technique for deriving feature importance i.e. Decision Trees and Random Forests. Unlike the case of regression, we can analyse and draw conclusions on continuous and categorical data together when using these algorithms as they are impervious to differences in variable type. Another nice feature about Trees is that we don't have to standardise and normalise features which makes visual analysis a lot more intuitive. We therefore use our initial untransformed dataset for this analysis.

In [24]:
AUC: 0.70
In [25]:
Normalized confusion matrix
In [26]:
In [27]:
Out[27]:

The model has accuracy below 70% (ROC curve slope flatter than models thus far) and the feature importance results are not very convincing seeing as many values are missing. This model needs a bit more work. Interesting to note that Thallium Scintograpy comes out very strongly even in this sub-optimal model. We will next look at random forests to see if we can improve on the single tree's accuracy.

In [28]:
AUC: 0.77
In [29]:
Normalized confusion matrix
In [30]:
In [31]:
Out[31]:

The Random Forest plot is interesting to analyse. Visually one can observe that ca disease (blue nodes) is evenly spread throughout the leave nodes of the entire tree. A large proportion of the early ca disease nodes occur for individuals with maximum heart rate < 150 and cholesterol >210. From here if ST depression >0.8 and one is male around 20% of the overall population is classified as having ca disease. Likewise, a large proportion of the population with max heart rate >150 and chest pain < 3.5 is classified as not having ca disease (orange nodes). Another interesting factor is that Thallium Scintography is reported as the second most important feature. It does however not feature strongly in the Decision Tree. It is likely that strong cross-correlation with other strong features such as maximum heart rate causes the Thallium feature to only surface as a confirmatory feature at lower levels in the tree.

We now build our final Logistic Regression model with the variables selected.

In [32]:
AUC: 0.87
In [33]:
Normalized confusion matrix

The accuracy results indicate that even though 5 variables were dropped, the model accuracy did not reduce by a significant amount. We can therefore confidently deploy this model with the knowledge that it is both robust and accurate.

Compare Logistic regression with Multi-Layer Perceptron (MLP)

We can now build a Multi Layer Perceptron to compare with the Logistic Regression.

MSE before model optimisation.

In [34]:
Accuracy Score: 0.83

We now optimise the NN architecture.

In [35]:

Now we optimise neural network regularisation parameter

In [36]:

The highest cross-validation accuracy score and hence the value to use for the alpha parameter is as follows.

In [37]:

MSE after regularisation

In [38]:
Accuracy Score: 0.83
Section 4: Analysis of results

Plot response curves

In [39]:
In [40]:

Our model is accurate enough to capture the directly proportionate relationship between several response variables (in order of strength of association, based on response curve output):

  • thallium_scint
  • num_major_vessels
  • st_slope
  • st_depression
  • exer_ind_angina
  • chest_pain_type
  • sex

and the inversely proportional relationship between:

  • max_heart_rate

and the outcome of confirmed Coronary Artery Disease. This is a positive outcome, as it means the model as applied to the validation dataset managed to capture the underlying signals in the data. We can therefore conclude that the model generalises well and that its accuracy is sufficiently high for this model to be used based on the features captured.

This makes sense if one takes into account that the first two variables:

  • thallium_scint: Arteries found to be: 1. Normal 2. Reversible defect and 3. Fixed defect
  • num_major_vessels: Number of major vessels (0-3) coloured by fluoroscopy

are by nature close to the definition of Coronary Artery Disease itself.

Accuracy analysis

In [41]:
AUC: 0.82
In [42]:
Normalized confusion matrix
Conclusion

Question 1

  • Can data science be used to improve the diagnosis of Coronary Artery Disease by means of using existing data sources?

Given the confidence in the Gold Standard i.e. Angiography and the consequences of incorrect diagnosis, in my mind it is unlikely that a test resulting in a sensitivity of approximately 90% or less will be considered as a replacement for Angiography - which is the accuracy we managed to attain by using data available prior to Angiography and a variety of ML approaches.

Question 2

  • Can data science be used to reduce the number of Angiograms performed in settings where this is problematic?

This analysis identified the 8 most important features to consider which are: thallium_scint, num_major_vessels, st_slope, st_depression, max_heart_rate, exer_ind_angina, chest_pain_type and sex.

An understanding of the factors contributing to a positive Angiogram test would assist clinicians in deciding when an Angiogram might be indicated and what the likely outcome would be. This could assist in early intervention, workup and planning.

The Decision Tree provides useful information as a starting point for a discussion on an algorithm to decide whether an Angiogram is indicated for a particular patient. Further analytic work to assist with such a discussion could be to investigate cut-off points for different age/ sex groups or for populations with different prevalence of disease.

Question 3

Thallium_scint, exer_ind_angina, st_depression and max_heart_rate have high accuracy and small confidence intervals when calculating correlation with CAD using a univariate logistic regression. This is arguably the most accurate method for calculating correlation in this setting for both continuous and categorical data. These variables however also feature highly in the SelectKbest and Random Forest variable importance tests and can therefore be considered as the most important factors in determining CAD.

Conclusion

Overall this analysis has provided valuable insights into the usage of data to assist medical practitioners with clinical decisions. Given the similar levels of accuracy that both the Logistic and MLP models attained it will be up to clinical decision makers to decide on the utility of these approaches for predicting CAD without performing an Angiogram. It is however more likely in my mind that predicting CAD in this manner is not feasible, and that one could focus on using insights into association of predictors to develop a clinical algorithm to reduce the number of Angiograms performed in a clinical setting.

References

  • Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304--310.
  • David W. Aha & Dennis Kibler. "Instance-based prediction of heart-disease presence with the Cleveland database."
  • Heart Disease Dataset as recorded by: V.A. Medical Center, Long Beach and Cleveland Clinic Foundation, Principal investigator & data collector: Robert Detrano, M.D., Ph.D.
In [42]: